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We study the structure of the solution space and behavior of local search methods on random 
3-SAT problems close to the SAT/UNSAT transition. Using the overlap measure of similarity 
between different solutions found on the same problem instance we show that the solution space is 
shrinking as a function of a. We consider chains of satisfiability problems, where clauses are added 
sequentially. In each such chain, the overlap distribution is first smooth, and then develops a tiered 
structure, indicating that the solutions are found in well separated clusters. On chains of not too 
large instances, all solutions are eventually observed to be in only one small cluster before vanishing. 
This condensation transition point is estimated to be etc = 4.26. The transition approximately obeys 
finite-size scaling with an apparent critical exponent of about 1.7. We compare the solutions found 
by a local heuristic, ASAT, and the Survey Propagation algorithm up to Oc. 

PACS numbers: 02.70.-c, 05.40.-a,64.60.Cn, 89.20.Ff 

I. INTRODUCTION 

Constraint satisfiability problems (CSPs) are ubiquitous in application areas such as planning, scheduling, product 
configuration, automated electronic design and more [H, . From the theoretical point of view, a problem's com- 
putational difficulty is determined by embedding it in a family of similar problems of increasing size. If an algorithm 
is known and the run-time grows at most polynomially in the size of the problem, problems belong to the class P and 
are considered (relatively) easy. For CSPs that belong to the hardest class of NP-complete problems, supposedly no 
such algorithm can be found 

The subject of this investigation is a benchmark CSP, the random /c-satisfiability, or random KSAT, problem. As 
described below in section |TT1 in KSAT M propositions in N logical variables are given, each depending on k variables, 
and the propositions, or clauses, all have to be satisfied simultaneously. There is a close relation between random 
KSAT and dilute spin glasses, where the "energy" of a configuration is the number of unsatisfied clauses. The models 
are dilute because the number of interactions each variable has with other variables is Poisson distributed with finite 
mean, i.e., generally finite (on a large enough problem), and "spin glass-like" because the clauses are random, so there 
J> ' is a large amount of frustration. This has received extensive attention from statistical physicists over more than ten 
: '~j ■ years [ij, [l^ [2^ [2^ . The direct analogy of the random KSAT problem is then whether the spin glass model has a 
rN [ ground state of energy zero, i.e. no violated constraints. 

I Deterministic algorithms to solve a CSP will definitely find a solution if there is one, and will answer in a finite 
■ ■ ■ ' time that no solution exists, if that is the case. A prime example is the DPLL algorithm Simpler procedures, 
called heuristics or stochastic local search algorithms, typically find a solution considerably faster, but may not always 
find a solution, although one is there. Several heuristics can be considered non-equilibrium relaxation processes in an 
associated spin glass model d, HH- RandomWalksat, the algorithm introduced by Papadimitriou |27|], walks in a flat 
landscape, where all non-solutions are treated equally, while Walksat [l], [s^l , Focused Metropolis Search (FMS) [2§| 
and Average SAT (ASAT) fi] are sensitive, in different ways, to the number of violated constraints. 

A main prediction of statistical physics on random KSAT problems has been the existence of a clustering transition 
below the SAT/UNSAT transition, a picture which has passed through several refinements. In the original version the 
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set of solutions was predicted to be connected in a single cluster below a threshold which for 3SAT is at about 3.9 [7|. 
Above this threshold the set of solutions was supposed to break up into a large set of smaller clusters. This scenario 
has been proven rigorously for K >d> [2l|, with the transition value only however determined approximatively. In a 
recent contribution it has been argued that that for K equal to four or higher, the previously determined clustering 
threshold is but one in a series more complex transformations [17,]. 

It has been believed that different clusters of solutions and different local minima of the number of unsatisfied 
clauses are separated by extensive barriers. This has not so far been shown rigorously or by systematic arguments 
from spin glass theory. If however this rather natural assumption would be true, then one would except that local 
heuristics have difficulties beyond the clustering transition, leaving the interval up to the SAT/UNSAT transition to 
more sophisticated non-local algorithms such as Survey Propagation [l^, [H, . Numerical experiments has however 
showed that some heuristics actually work linearly on average well beyond all so far predicted clustering transitions 
on as AT [li,[2^. 

In this paper we present numerical evidence that a "cluster condensation" transition occurs in random 3SAT, 
and determine its value to be approximately 4.26. We cannot exclude that the location in fact coincides with the 
SAT/UNSAT transition at 4.267. The transition approximately obeys finite-size scaling, with an apparent critical 
exponent of about 1.7. 

This paper is structured as follows: in section |TT] we present the KSAT problem in a little more detail, and review 
some solving techniques for satisfiability problems. In section Hill we investigate the clustering transition, and compare 
the solutions found by the AS AT heuristic with those found by the Survey Propagation algorithm. In section ITVl we 
discuss the practical and theoretical significance of our results. 

II. THE RANDOM fc-SATISFIABILITY PROBLEM AND HEURISTICS 

A. The SAT problem 

The satisfiability problem (SAT) is a central problem in theoretical as well as practical computer science. It is the 
problem of assigning values to N binary variables, x G {0, 1}, given M constraints. Each constraint specifies preferred 
values for K of the variables and is said to be satisfied if at least one variable equals its specification. The whole 
problem is said to be satisfied if all constraints are. 

SAT problems belongs to the class of NP-complete problems which means that no known algorithm is able to solve 
a worst case instance in time polynomial in system size. Since worst case analysis is hard, and may not be relevant 
to the typical behavior, one is often also interested in how well a certain method performs on a ensemble of random 
instances of the problem. That is, given N variables construct M constraints each containing K randomly chosen 
variables with random specified values {0, 1}. This problem is known as random K-SAT, it is NP-complete when K 
is three or greater, and is the problem studied in this paper. 

Random K-SAT has been shown to display a phase transition when varying the ratio of the number of constraints 
to the number of variables a = ^ [l^, [2^ . Below the transition a generic instance is with high probability (that is 
with 100% certainty when M, N oo) satisfiable, and above it its not. This fact, which extends also to other random 
CSPs such as coloring, has given the problem great attention from the theoretical physics community. Methods from 
physics have also predicted a second transition, below the SAT/UNSAT one, where the solutions are beginning to 
cluster into distinct region of the state space. This divides the SAT region into a EASY and HARD region. Lately 
even more transitions which divides these phases even further have been suggested [U H^l • The exact locations of 
the these clustering transition points on 3SAT and details of their nature are still open questions. 

B. SAT solving techniques 

As discussed above, KSAT problems for K greater than two are NP-complete, and no algorithm with guaranteed 
preformance that does not grow very quickly in the size of the problem is likely to exist. Due to their great practical 
interest, algorithms on CSPs in NP-complete classes, have nevertheless been studied extensively, see [l^ for a recent 
review. 

In parallel to deterministic algorithms (with exponential worst-case behaviour), there are well-performing non- 
deterministic algorithms, or heuristics. The currently best achieving methods for random K-SAT (and some other 
satisfiability problems) are either message-passing based (Belief Propagation (BP), Survey Propagation (SP)) or local 
search methods (walksat, FMS, AS AT). The former use structured iterations of guesses (beliefs) between variables in 
order to reduce the space of possible configurations in the search. The latter use only local information together with 
some target function as a guide in configuration space. Currently, the hardest instances of random 3SAT that can be 
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solved with Survey Propagation have a clause to variable ratio of 4.25 [1, [13, HSli while the best local search methods 
solve problems up to at least 4.21 [1,[2^ in linear time, on the average. 

There are no a priori theoretical reasons known to us why either message-passing or local search heuristics (or both) 
cannot be pushed beyond their current bounds. Indeed, it is well known that vanilla-flavored SP can be improved by 
backtracking in the decimation step 



III. THE OVERLAP DISTRIBUTION AND THE CHAIN METHOD 
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In this section we will study the overlaps between a finite number L of solutions found on one instance of 3SAT 
problem. L will most of the time be less or much less than N, the number of variables, so that the mutual overlaps 
can be arranged in an overlap matrix: 
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The sorted set of all unique distances (self overlaps left out) gives a characteristic sample of the distribution of overlaps 
between solutions found by a particular random algorithm on this instance. The solutions arc found using ASAT [j| 
with all starting configurations chosen independently and randomly. 

To examine how the overlap distribution changes with the number of constraints in the problem one starts with a 
problem instance without any constraints (a = 0). The distribution of overlaps in Da=o are then trivially found to 
be randomly distributed around N/2. That is, there are no correlations between different solutions One then adds 
constraints one at the the time until no solutions can be found, and thereby generate a chain of instances: 



C — [Xfj— 0, ■ ■ ■ 5 -^a- 
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Let us note that the achievable otmax depends on the algorithm, the size of the problem and the chain. Wc will first 
study the overlap distribution in one chain, and then compare different chains, and how their properties change with 
N. 

The first property to study is the average of the overlaps between solutions. Fig. [T] a) shows one chain with 
A'' = 2000 variables and L — AQ solutions found on each instance. The average clearly increases with a, first smoothly 
to about 70%, and then sharply to above 90%. Fig.[l]b) further shows the variance of the overlaps. This quantity first 
also increases, but then eventually decreases to zero. Taken together, these data imply that the overlap distribution 
starting from a = 3.7 is first fairly concentrated at a value of about two thirds, then develops a fraction of large 
overlaps, so that both the average and the variance increases, and finally all concentrates into the large overlap phase. 

A more detailed picture is obtained by plotting directly the overlap distribution in a chain. Fig [2] shows that 
indeed the distribution starts out fairly concentrated, then grows both higher on average (solutions closer together on 
average) and also steeper (some solutions much more similar). At around 4.25 an interesting transition takes place, 
where some solutions remain far apart, some are found much closer. At the last value investigated, pairs of solutions 
are apparently either quite close (overlap above 90%) or fairly far apart (overlap around 60%). 

The overlap plot of figure [5] do not distinguish between whether there is only one cluster where all high overlap 
solutions are found or a set of clusters. Two possible scenarios that would generate a curve like the one for a = 4.3 
are: 



1. Most of the solutions arc found in a small region of the state space. The rest are found randomly distributed. 



2. All solutions are found in small but separate regions. Few solutions are randomly distributed. 



In trying to distinguish between the one and many cluster scenarios one can use some kind of clustering detection 
algorithm. We used a simple algorithm described in pseudo code in the following listing. 
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(a) Mean value of overlap 



(b)Variance of overlap 



FIG. 1: The mean value and the variance of the overlaps for one chain from a — 3.7 to a = 4.3. N=2000 variables. 



1 For a given instance generate n solutions 
2 

3 for radius = to N/2 



4 rings = 

5 for i = to n 

6 if solution(i) is not inside a ring 

7 place a ring with radius centered on solution(i) 

8 rings = rings + 1 

9 else 

10 solution(i) belongs to the covering ring 

11 end 

12 plot rings vs. radius 

13 end 



14 end 



This will count the number of rings with radius r that is needed to cover the whole set of solutions. The result of 
the algorithm for values around the clustering transition {a =4.3, 4.25 and 4.2) for the same instances as in fig [5] are 
depicted in fig[3l 

Comparing the results for the three values of a shows clearly that before the transition when increasing the radius 
of the rings a continuously growing number of solutions are covered. This indicates a (nearly) homogeneous density of 
solutions in the interval from 0.17V to 0.3N. When increasing the number of constraints the number of rings needed 
decreases rapidly around r = 0.05iV and only a few (~ 10) rings with radius O.ISA^ are required to cover the whole set. 
In the last instance, for a = 4.3, the number of rings decreases very rapidly from ~ N to only 4 at radius 0.05iV. This 
indicates that with 4 rings, each covering about O.IA^ of the variables, the whole set of solutions can be covered, while 
only one ring is sufficient when the radius is ~ 0.35N. The natural interpretation is of course that in this example 
there are four different clusters with diameter O.IA^ which are separated by a distance of ~ 0.3N. Altogether, the 
cluster plots favor the second scenario sketched above. These facts could not be obtained be the overlap plots alone. 

An important issue when trying to characterize a set of solutions from a finite sample is possible bias in the sampling 
method. We cannot answer this question definitely, but we can compare with other solution methods. To this end 
we have compared ASAT with the Survey Propagation (SP) method [231. [2^. SP, as a satisfiability solver, contains a 
deterministic step, where variables are eliminated incrementally based on clues from the Survey Propagation message- 
passing algorithin, which for KSAT can be viewed as Belief Propagation in a space of three values per variable (set 
0, set 1, fre e) [ 91, [l9|. This deterministic step is followed by a stochastic local search, by default Selman-Kautz-Cohen 
walksat [l|, |30|7 For our purposes, SP can therefore be thought of as practically a deterministic algorithm, since the 
solutions found after the stochastic local search typically have large overlap. Fig. 0] shows chains of overlaps between 
solutions found by SP and ASAT respectively, and two solutions found by ASAT. These figures suggest that the 
solution found by SP is typical with respect to the solutions found by ASAT, and hence that the set of solutions 
sampled by ASAT are relevant also for very different solution methods. 
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FIG. 2: Rank plot of the values of the overlap matrix D for a =3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2, 4.25 and 4.3. Self-overlaps are 
left out. For low values of a, the curves arc continuously distributed around one mean value. When the number of constraints 
increases the distribution tends to spread out. Increasing a even further results in a break-up of the distribution. The solutions 
found are then grouped together in clusters. The overlap of solutions in the same cluster is high (>0.9N), while the overlap 
between solutions in different clusters is lower (~0.6N) 
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FIG. 3: The result of the clustering algorithm shows how many disks with given radius are required to cover the whole set of 
solutions. The three curves shows this for a =4.2, 4.25 and 4.3. The small subplot is an enlargement of the curve for a =4.3. 
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FIG. 4: The first plot shows overlap between a solution found by SP and one found by ASAT for a =0 to 4.3. The other plot 
show the overlaps between two different solutions found by ASAT on the same instance. The number of variables are 2000 and 
10000 in both cases. 




(a) (b) 



FIG. 5: (a) For each chain the point in a where the smallest overlap in the overlap matrix is above O.SA'^ is marked for N=100, 
200, 400, 1000 and 2000 variables. 110 chains are used for each N. The point of the jump of each chain marks one point in 
the rank plot. The mean value of the N=2000 curve is 4.245.(b) Finite size scaling applied to the same data. The best fit is 
achieved for v = 1.7 and a^o = 4.26. The ruggedness of the curves for large N is due to discretization in a. 



Finally, we turn to comparing clustering in different chains. The transition from a finite number of clusters to 
only one is relatively easy to characterize since the both the variance and the minimal value of the overlap matrix D 
change abruptly. Figure 5 (a) [ shows a rank plot of the point where all solutions found have overlap more then 0.8iV. 
This value gets sharper with larger N which suggests that when iV — > oo a phase transition from several to one cluster 
takes place. Straight-forward finite-size scaling of the data gives a transition a — 4.26 with a critical exponent of 
1.7. 

We note that these values are given for the chains that could be prolonged until the clustering transition. For some 
instances the average value increases to but does not reach 0.7A^ until no more solutions are found by the algorithm, 
within the time cut-off used here. The number of instances that displays a transition before the time cutoff is shown 
in figure [6] 



IV. CONCLUSION AND DISCUSSION 



The claim that the geometry of the set of solutions to large random Constraint Satisfaction Problems is systemat- 
ically different depending on the structure of the problem has received wide attention in both statistical physics and 
computer science Isl. Iisj . For random fc-satisflability problem, rigorous results exist for fc > 8, new predictions 
have been recently made for > 4, while the benchmark random 3SAT problem is now considered open from the 
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FIG. 6: The fraction of instances that display a condensation transition within the time cutoff, 
theoretical point of view. 

We have determined the clustering structure in numerical experiments, by solving many 3SAT instances of the 
same type many times with a simple stochastic local search procedure ASAT. This heuristic solves random 3SAT in 
linear time, on average, up to at least 4.21 clauses per variables, as we have shown previously Q. For problem sizes 
of thousands and tens of thousands of variables it can also be readily used beyond the linear regime, up to around 
the satisfiability/unsatisfiabihty threshold at about acr = 4.267 clauses per variables. 

We find that there is indeed a cluster condensation transition in the solutions found by ASAT at around 4.26, very 
close to (Xcr- The transition approximately obeys finite-size scaling, with an apparent critical exponent of about 1.7. 
We find indications of the fact that the space of solutions divides in several distinct clusters in some region below the 
transiontion, approximately between 4.21 < a < 4.26. We have also shown that the solutions found by ASAT are 
compatible with those found on the same problems by the Survey Propagation (SP) algorithm, in the sense that the 
solutions found by SP are not atypical with respect to the sets of solutions found by ASAT. 

We conjecture that the absence of any (observable) clustering transition except very close to the satisfiabil- 
ity/unsatisfiability threshold is related to the surprising effectiveness of simple heuristics in solving large and hard 
random 3SAT problems, and perchance other random CSPs as well. 
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